-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Extract metric handling from the desired balancer reconciler #120085
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract metric handling from the desired balancer reconciler #120085
Conversation
The DesiredBalanceReconciler applies desired balance to the routing table cluster metadata. This patch extract logic from the reconciler that pushes updates to the desired balance metrics. This removes extra parameters passed into the DesiredBalanceReconciler, and enables expanding the metrics that are collected without passing more objects into the reconciler class. Relates ES-10341
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
| long index, | ||
| RoutingAllocation routingAllocation, | ||
| List<ShardRouting> ignoredShards, | ||
| BalancingRoundStats.Builder clusterAllocationStatsBuilder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like it's a bit confusing how there's a stats builder on the input, but we only use it for the eager reconciliation (assuming I've understood the code right). Could we instead just create one specifically for the eager reconciliation to make it clear it's local to that process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand. What part are you referring to as eager reconciliation? Is it this balance() method?
My intention with passing around a stats data structure is to collect additional statistics throughout the balancing round -- for example, I'll want balancing round start and end times, so I'll need stats tracking as early as this line before computation begins. A balancing round begins with a DesiredBalanceInput submitted to the ContinuousComputation object, which queues requests for a balancing round: the latest DesiredBalanceInput submission will be run async.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean in allocate() we call reconcile(...) with desiredBalanceInput.clusterAllocationStatsBuilder() but then in the continuous computation we create separate StatsBuilders to pass to submitReconcileTask, even though desiredBalanceInput is in scope.
I just wonder whether the statsBuilder belongs in desiredBalanceInput? but I find the code a bit hard to follow so maybe there's an obvious reason for it being there.
I think I'm struggling to understand the lifecycle of the builder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh. I think you found something buggy. So allocate will queue a DesiredBalanceInput for computation, which gets picked up asynchronously. But then reconcile() is run immediately and updates the metrics. Computation happens eventually.
I thought I was being thorough by passing the stats builder everywhere, but reconciliation always runs whereas computation occasionally runs (queues, so an input will get skipped if there's a new input). And then computation will normally lead to reconciliation being called.
Okay... So allocate() needs two separate StatsBuilders, one for immediate reconciliation, and the other passed to computation but ultimately reconciliation. Or alternatively we skip instrumenting computation and create fresh StatsBuilders for each reconciliation event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there will be some use of the StatsBuilder in the computation phase (is that what // TODO: this will be expanded upon shortly in ES-10341. refers to?)
If so, is is as simple as just
- Removing StatsBuilder from the DesiredBalanceInput
- Creating one locally to pass to
reconcile(...)inallocate(...) - Using the existing local
StatsBuilderinprocessInput
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I've updated this patch to be purely refactor work. It doesn't lead into new work anymore. I think it's cleaner without pushing desired metrics objects into the Reconciler to handle: this cleans up the constructor and purpose of the Reconciler.
|
I find it hard to judge how useful the refactoring decisions here and in #119916 are without actually seeing what metric are going to be collected. IMO, it would good to see a draft PR (potentially w/o tests is also ok) which shows concretely what we're collecting and then work out what are reasonable pieces to separate out, if that ends up being too large to review. My experience adding some metrics here was that while trying to plug things in, you'd stumble upon different twists and wrinkles that might impact the structure. (not sure how much has been discussed, but I find it hard to asses these refactoring PRs). |
Sure, here's a WIP PR for the summed changes. I have some notes on what to look at in that PR. Let me know what you think.
Yes, the code is definitely tricky. It takes a lot of thinking to sort out what anything is right now. I haven't really discussed the code implementation with anyone so far, only what I've tried to explain in the PRs. The design ticket, ES-10260, explains the high-level, in detail, to see where I'm headed. |
|
Thanks, I had a look at the WIP PR. I'm a bit lost following the changes, probably because: 1) the PR seems to be very general even though the specific metrics are still TODO. Any chance we could start small and gradually extend it? It is hard to follow the changes. 2) There is a lot of noise that is not really a core change, e.g. adding new comments to existing code and renames, it makes it hard to focus on what the actual change is. Just two suggestions that would make reviewing for me personally easier. As for the way stats are collected, I think I agree that passing around builder like that is a bit confusing and probably we shouldn't add it to the input like that. Do the gathered metrics need to be all centralized like that? |
|
I've realized that my understanding of the code was incorrect. Reconciliation only applies a handful of changes from the current desired balance computation, until it reaches throttling saturation (e.g. each node can only participate in X concurrent shard recoveries). So, as an example, the reconciler might stop assigning shards in So I'll need to instrument the computation, not reconciliation. I'm investigating if I can compare old to new desired balance here to get the stats. I think refactoring the reconciler to pull out stats reporting still makes sense, makes things more modular, but I'll see how it looks. |
|
Slowly making progress here.. I've filed ES-10581 as an extra twist. |
DiannaHohensee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alrighty, I've refactored my change to only extract the logic out of the Reconciler, keeping all the desired balance metric update logic in the Allocator.
| long index, | ||
| RoutingAllocation routingAllocation, | ||
| List<ShardRouting> ignoredShards, | ||
| BalancingRoundStats.Builder clusterAllocationStatsBuilder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I've updated this patch to be purely refactor work. It doesn't lead into new work anymore. I think it's cleaner without pushing desired metrics objects into the Reconciler to handle: this cleans up the constructor and purpose of the Reconciler.
|
I'm closing this -- since there hasn't been much activity and it's a bit messy from the pivot -- and replacing it with #121771. |
The DesiredBalanceReconciler is responsible for applying updates to
the cluster states that reflect shard allocation changes towards a
DesiredBalance. It isn't the Reconciler's responsibility to handle
pushing APM metrics. This patch cleans up the Reconciler constructor
and logic by extracting metric handling, modularizing metric updates
in the Allocator level of the code instead of being split across the
two components.
I want to expand upon the metric collection, so I'm yanking it out of the Reconciler. Otherwise I'll have to push more objects into the Reconciler to create all the stats there, which isn't really the Reconciler's business. Precursor to plugging in the work started over here #119916: once they're both in, I can start plug in the new stats.This is purely a refactor now.